Overview

This lesson will guide students through the process of merging two messy dataframes by a common key using merge.

Prerequisites

Teaching Objectives

By the end of this lecture, students should be able to:

Roadmap

  1. Example introduction.
  2. Data exploration and problem formulation.
  3. Merge toolkit.
  4. Example problem.
  5. Break and reflection.
  6. Real-world application.

Scenario

!!!

Our gathering of famous duos has been beset by data goblins, resulting in all of our duos being separated! They have gathered together in two locations (dataframes), the california and massachusetts. Every duo wants to find their way back home, and to reunite with their opposite.

Practically, we want to use the data provided to produce a single dataframe, that has one row for each duo pair, one column for each member, their associated color, and hometown. And example of our desired output can be seen here.

A dataframe with 5 columns titled member 1, member 1 color, member 2, member 2 color, and hometown

Desired Output Example

Workspace Preparation

Please run the following code to download our class data, and prepare it for the lesson.

# Load in student survey csv
.raw_survey = read.csv("https://docs.google.com/spreadsheets/d/1E8ME8ldv8Pv4-saL-ZYHJckmVPNgcNd53LasYiW0jwU/export?format=csv&usp=sharing", header = TRUE, stringsAsFactors = FALSE)

# Let loose the data gremlins
# Don't look at this file until after lecture! Here there be spoilers.
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/1_split_duos.R")

Exploring the Data

Our duos were split, and individuals were cast off into to different locations (dataframes), named california and massachusetts. Let’s start by looking at them. Each dataframe has 4 columns: the name of the individual, a color, the name of their alter (pairing), and their hometown.

Starting with the california:

head(california, 10)
##                 name  color                alter        hometown
## 1           Sherlock   Blue                 <NA>          LONDON
## 2             Batman    Red                Robin          GOTHAM
## 3              Mandy    Red                 <NA>          EVILLE
## 4               Aang Orange               Katara   REPUBLIC CITY
## 5               Link  Green                 <NA>     CASTLE TOWN
## 6           Han Solo   Blue                 <NA>        CORELLIA
## 7               Kiki    Red                 <NA>          KORIKO
## 8  Vash the Stampede    Red Nicholas D. Wolfwood A DESERT PLANET
## 9      Konrad Franco    Red    Konstanze Conradt      SANTA ROSA
## 10        Scooby Doo Orange               Shaggy      COOLSVILLE
str(california)
## 'data.frame':    11 obs. of  4 variables:
##  $ name    : chr  "Sherlock" "Batman" "Mandy" "Aang" ...
##  $ color   : chr  "Blue" "Red" "Red" "Orange" ...
##  $ alter   : chr  NA "Robin" NA "Katara" ...
##  $ hometown: chr  "LONDON" "GOTHAM" "EVILLE" "REPUBLIC CITY" ...

And then the massachusetts:

head(massachusetts, 10)
##                    name  color             alter        hometown
## 1  Nicholas D. Wolfwood   Blue Vash the Stampede A_Desert_Planet
## 2                Watson Indigo          Sherlock     London_(13)
## 3                  JiJi Indigo              <NA>          Koriko
## 4                 Billy   Blue              <NA>          Eville
## 5                Shaggy  Green        Scooby Doo      Coolsville
## 6             Chewbacca Orange              <NA>        Corellia
## 7          Harry Potter    Red  Hermione Granger        HOGWARTS
## 8     Konstanze Conradt Orange     Konrad Franco      Santa_Rosa
## 9      Hermione Granger Yellow              <NA>        Hogwarts
## 10                Robin Yellow            Batman          Gotham
str(massachusetts)
## 'data.frame':    15 obs. of  4 variables:
##  $ name    : chr  "Nicholas D. Wolfwood" "Watson" "JiJi" "Billy" ...
##  $ color   : chr  "Blue" "Indigo" "Indigo" "Blue" ...
##  $ alter   : chr  "Vash the Stampede" "Sherlock" NA NA ...
##  $ hometown: chr  "A_Desert_Planet" "London_(13)" "Koriko" "Eville" ...

Checkpoint

  1. What do you notice about these dataframes? Does anything surprise you?
  2. Considering our desired output, do you have any ideas regarding how we can join these dataframes?
  3. What, if any, manipulations do you think we will need to perform before we can try and merge the data?

Joining Data in the Abstract

Our eventual goal is to join the two dataframes above, and have a single row for each duo pair, one column for each member, and one column for their hometown. How can we get there? Let’s look at a few possibilities.

If you would like to test these dataframes within R yourself, you can run this code. The results will be identical to the figures.

source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/2_merge_examples.R")

Join Vertically using rbind

If you have two dataframes and want to combine them, you need to decide what direction you want to combine them in. If you would like to combine them vertically, or stack the rows on top of each other, you can use rbind or row bind. rbind is helpful if you have two dataframes with the same columns, and you want to combine the cases (rows). However, it will not work if the columns are different. rbind accepts an arbitrary number of dataframes, here we use two: upper_rbind_df and lower_rbind_df.

outcome = rbind(upper_rbind_df, lower_rbind_df) 
Two dataframes that merge well bertically using the rbind function

rbind example 1

Join Horizontally using cbind

If you would like to combine two dataframes horizontally, so that you add more columns on to a dataframe, you can use cbind or column bind. The cbind function takes an arbitrary number of dataframes as it’s arguments. We’ll provide just our two here, but try repeating one multiple times if you would like to see what happens.

outcome = cbind(left_cbind_df, right_cbind_df) 
Two dataframes that merge well horizontally using the cbind function

cbind example 1

We can see out new outcome dataframe was created as expected. This method can work, but you must be very confident about the structure of your data. For example, in the following figure, what issues do you expect to run into with this method?

Two dataframes with misaligned data

cbind example 2

Joins using merge

cbind may work in some situations, but is very “brittle” – or easy to break – with any changes to the data. A more reliable method of merging makes use of a common key between two sources of data. The key can be anything, like a numerical ID or a string, but they must always be unique. Joins that use the key method will be familiar to anyone who has used SQL in the past, as they follow the same naming convention as in that language.

Inner Join

An inner join tries to join two sets of data using a common key, keeping only the data that exists in both of the original data sets. This is commonly illustrated using a venn diagram similar to the one below. Only the area highlighted with green will be included in the output.

A venn diagram with circles labeled data 1 and data 2. The intersection of them is highlighted with the text 'inner join'

Inner Join - Venn Diagram

In the context of our data, it might look something like the following if we use hometown as our key. We can use the following code to do an inner merge using the merge function. The merge function in R requires two arguments, x and y, which are the dataframes we would like to merge. We will be using left_merge_df and right_merge_df, and asking merge to use hometown as the key to merge by.

outcome = merge(x = left_merge_df, y = right_merge_df, by = "hometown") 
Two dataframes being joined by common hometowns, removing cases where there is no hometown match

Inner Join - Example Data

Everything seems in order. We can see that when using an inner join, cases where both dataframes have a single row with our key will be matched and joined into a single output dataframe. However, if there are rows in either data frame without a match, those rows will be dropped from our data.

Outer Join

An Outer Join is the opposite of an inner join. Rather than just looking for those rows which have a key in common, it will join every row, regardless of the keys, inserting blank values where there is no match. You can get a sense of this merge with the venn diagram below, where the green indicates good matches, while the orange indicates partial matches.

A venn diagram with circles labeled data 1 and data 2. The intersection of them is highlighted in green, while the edges are highlighted in orange

Outer Join - Venn Diagram

In the context of our data, an outer join may look something like the following, again using hometown as our key. The code for an outer join is similar to an inner join, with one addition. We will want to specify that all = TRUE, essentially saying we want everything included.

outcome = merge(x = left_merge_df, y = right_merge_df, by = “hometown”, all = TRUE)
Two dataframes being joined by common hometowns, adding rows for cases with no matches, but introducing NAs

Outer Join - Example Data

In this example, we see that an outer join will keep all available cases, but we introduce NA values into the resulting dataframe where no match could be made. This may or may not be an issue depending on what your next steps are.

Left Join

A left join allows you to pick one of the two dataframes you are joining and prioritize it. It essentially takes all of the cases in the dataframe on the ‘left’ side, and searches in ‘right’ dataframe to join what it can. Keeping with the venn diagram representations, it would look like the following:

A venn diagram with circles labeled data 1 and data 2. All of the data 1 circle is highlighted, but only the part od data 2 that intersects data 1 is highlighted

Left Join - Venn Diagram

Using our example data, a left join would look like this. The R code is similar, to the previous examples, but we specify that instead of all = TRUE, we just want all.x = TRUE. Recall that in the function call, x is the first dataframe, or the “left” one.

outcome = merge(x = left_merge_df, y = right_merge_df, by = “hometown”, all.x = TRUE) 
Two dataframes being joined by common hometowns, keeping all of the left dataframe data, and adding the right dataframe data when available

Left Join - Example Data

We can see that while all of the data from our ‘left’ side is preserved, unmatched data on the ‘right’ is discarded. You can also technically do a right join, which will do the same thing with sides reversed, but moving your prioritized data set to the left is more common.


Checkpoint

Let’s clean up our environment a bit using the following code:

rm(e1_cbind, e2_cbind, e3_cbind, e1_key, e2_key, e3_inner, e3_outer, e3_left)
  1. What is the difference between a column bind and a key based join?
  2. What is the difference between an inner join, outer join, and left join?

Key Selection

Now that we have a few possible tools, we can start considering how we are going to join our duos back together. Let’s glance back at our california and massachusetts dataframes to get an idea of what we’re working with.

head(california, 10)
##                 name  color                alter        hometown
## 1           Sherlock   Blue                 <NA>          LONDON
## 2             Batman    Red                Robin          GOTHAM
## 3              Mandy    Red                 <NA>          EVILLE
## 4               Aang Orange               Katara   REPUBLIC CITY
## 5               Link  Green                 <NA>     CASTLE TOWN
## 6           Han Solo   Blue                 <NA>        CORELLIA
## 7               Kiki    Red                 <NA>          KORIKO
## 8  Vash the Stampede    Red Nicholas D. Wolfwood A DESERT PLANET
## 9      Konrad Franco    Red    Konstanze Conradt      SANTA ROSA
## 10        Scooby Doo Orange               Shaggy      COOLSVILLE
head(massachusetts, 10)
##                    name  color             alter        hometown
## 1  Nicholas D. Wolfwood   Blue Vash the Stampede A_Desert_Planet
## 2                Watson Indigo          Sherlock     London_(13)
## 3                  JiJi Indigo              <NA>          Koriko
## 4                 Billy   Blue              <NA>          Eville
## 5                Shaggy  Green        Scooby Doo      Coolsville
## 6             Chewbacca Orange              <NA>        Corellia
## 7          Harry Potter    Red  Hermione Granger        HOGWARTS
## 8     Konstanze Conradt Orange     Konrad Franco      Santa_Rosa
## 9      Hermione Granger Yellow              <NA>        Hogwarts
## 10                Robin Yellow            Batman          Gotham

In order to join these dataframes and reunite our duos, we need to choose some method to match on. For this activity, we will be using one of the key based methods we covered above. However, data is rarely in a clean and usable state from the start. Often the bulk of your time will be spent preparing the data to do the actual task you want to accomplish.


Checkpoint

  1. What column do you think could serve as a key for us to join on?
  2. What if any manipulations will you need to do for the key to be usable?

Spoiler Space

Don’t scroll past here until you’ve answered the above questions!
















Key Creation

For our example, we will be using the hometown as out matching key; we ultimately want all of our duos to get home! However, the data gremlins have made this task more difficult, as the hometowns have various noise in them that prevents us from using them as a key. We’re going to have to repair the damage before we can reunite our duos.

We need to make our hometowns exact between out duos to use them as a key. To do this, we will need to standardize the capitalization, and the space deliminators in the hometown column. Let’s take care of the capitalization first.

While we could spend time making sure all of our hometowns are properly capitalized, it isn’t really necessary in this application; we just need them to match. An easy way to do that is to set everything to upper or lower case. Let’s use lower case here, making use of the tolower function in R. It will convert any character vector into all lower case.

First for the california:

california$hometown = tolower(california$hometown)

And then the massachusetts:

massachusetts$hometown = tolower(massachusetts$hometown)

Let’s pause to look at our work …

head(california, 10)
##                 name  color                alter        hometown
## 1           Sherlock   Blue                 <NA>          london
## 2             Batman    Red                Robin          gotham
## 3              Mandy    Red                 <NA>          eville
## 4               Aang Orange               Katara   republic city
## 5               Link  Green                 <NA>     castle town
## 6           Han Solo   Blue                 <NA>        corellia
## 7               Kiki    Red                 <NA>          koriko
## 8  Vash the Stampede    Red Nicholas D. Wolfwood a desert planet
## 9      Konrad Franco    Red    Konstanze Conradt      santa rosa
## 10        Scooby Doo Orange               Shaggy      coolsville

Looks good. Now let’s deal with the spacing. If you look at our dataframes and compare across them, you’ll notice some hometowns have underscores in them. This seems to only happen when there would be a space. We can use that rule to replace all the underscores with a space instead. To do this, we’ll use a gsub function, or “global substitution.” It will look over all of the strings, and globally replace a pattern we specify with a substitution. In this case, we want to replace all "_" with " ".

# gsub(Pattern, replacement, X)
# pattern is what we want to replace
# replacement is what we want to replace the pattern
# X is the vector of character strings we want to run this replacement on
california$hometown = gsub(pattern = "_", replacement =  " ", x = california$hometown)

And now the massachusetts:

massachusetts$hometown = gsub(pattern = "_", replacement = " ", x = massachusetts$hometown)

Let’s check our work …

head(california, 10)
##                 name  color                alter        hometown
## 1           Sherlock   Blue                 <NA>          london
## 2             Batman    Red                Robin          gotham
## 3              Mandy    Red                 <NA>          eville
## 4               Aang Orange               Katara   republic city
## 5               Link  Green                 <NA>     castle town
## 6           Han Solo   Blue                 <NA>        corellia
## 7               Kiki    Red                 <NA>          koriko
## 8  Vash the Stampede    Red Nicholas D. Wolfwood a desert planet
## 9      Konrad Franco    Red    Konstanze Conradt      santa rosa
## 10        Scooby Doo Orange               Shaggy      coolsville

Looking good! It seems we have a consistent key. Now for the ultimate task of reuniting our duos.

Format and Merge

Unfortunately, our duos were not split evenly. Some duos both got sent to the same place, while others were sent to different places. As with many data science problems, there are a number of ways to overcome this, and no “best” answer.


Checkpoint

  1. How would you prepare our dataframes for a clean merge, now that we have a good key?

Spoiler Space

Don’t scroll past here until you’ve answered the above questions!
















I’ll walk you through one method, but stress that this is not an “optimal” answer, it is just one of many. For my solution, I will aggregate all of the individuals into one dataframe, and then split them so that each new dataframe has only one instance of each hometown.

First, I’ll combine all my individuals. You can add on rows to a dataframe using rbind much the same way you can add on columns using cbind in R. However, rbind is typically less fraught, as rows are typically conceptualized as individual and independent of each other, unlike columns which typically show data about rows. rbind essentially stacks two dataframes on top of each other, or appends the rows of one dataframe to another.

# combine all out individuals into one dataframe
individuals = rbind(massachusetts, california)

Now all of our individuals are in a single dataframe. Our next task is to split them evenly, so one member of each duo, by hometown, is in a separate dataframe. For this we can make use of the duplicated function. duplicated checks if a specific value has appeared already within a vector. For example, if you had a vector of c("A", "B", "B", "C", "D", "C"), the second B and second C would be considered duplicated, and thus output TRUE while all other values would output FALSE.

In our case, if we run duplicated over the hometown column for our individuals, and split the TRUE into one datafarme and FALSE into another, we should end up with two dataframes, each containing one individual from a hometown. Let’s give it a try. First, let’s see what it looks like when we run duplicated. We’ll assign it to our individuals dataframe so it is easier to see.

# run duplicated over hometowns and assign as a new column in our individuals dataframe
individuals$duplicated = duplicated(individuals$hometown)

# show the dataframe
individuals
##                    name  color                alter        hometown duplicated
## 1  Nicholas D. Wolfwood   Blue    Vash the Stampede a desert planet      FALSE
## 2                Watson Indigo             Sherlock     london (13)      FALSE
## 3                  JiJi Indigo                 <NA>          koriko      FALSE
## 4                 Billy   Blue                 <NA>          eville      FALSE
## 5                Shaggy  Green           Scooby Doo      coolsville      FALSE
## 6             Chewbacca Orange                 <NA>        corellia      FALSE
## 7          Harry Potter    Red     Hermione Granger        hogwarts      FALSE
## 8     Konstanze Conradt Orange        Konrad Franco      santa rosa      FALSE
## 9      Hermione Granger Yellow                 <NA>        hogwarts       TRUE
## 10                Robin Yellow               Batman          gotham      FALSE
## 11               Katara   Blue                 Aang   republic city      FALSE
## 12               Watson Indigo                 <NA>          london      FALSE
## 13                Zelda Violet                 <NA>     castle town      FALSE
## 14      Lorelai Gilmore Orange         Rory Gilmore    stars hollow      FALSE
## 15         Rory Gilmore Yellow                 <NA>    stars hollow       TRUE
## 16             Sherlock   Blue                 <NA>          london       TRUE
## 17               Batman    Red                Robin          gotham       TRUE
## 18                Mandy    Red                 <NA>          eville       TRUE
## 19                 Aang Orange               Katara   republic city       TRUE
## 20                 Link  Green                 <NA>     castle town       TRUE
## 21             Han Solo   Blue                 <NA>        corellia       TRUE
## 22                 Kiki    Red                 <NA>          koriko       TRUE
## 23    Vash the Stampede    Red Nicholas D. Wolfwood a desert planet       TRUE
## 24        Konrad Franco    Red    Konstanze Conradt      santa rosa       TRUE
## 25           Scooby Doo Orange               Shaggy      coolsville       TRUE
## 26             Sherlock   Blue               Watson     london (13)       TRUE

Looks good! Now we can use that new column to evenly split our dataframe, and them cleanly pair our duos back together! First let’s evenly split them:

# assign to a new duo1 dataframe, the contents of dataframe "individuals"
# SUCH THAT
# the rows whose value in column `individuals$duplicated` is equal to TRUE
# and give me all columns
duo1 = individuals[individuals$duplicated == TRUE, ]

# assign to a new duo1 dataframe, the contents of dataframe "individuals"
# SUCH THAT
# the rows whose value in column `individuals$duplicated` is equal to TRUE
# and give me all columns
duo2 = individuals[individuals$duplicated == FALSE, ]

Alright, now we have two dataframes, each containing one member of a duo. Time to finally merge them back together! We can do this using the merge function we learned about previously. Let’s try an inner merge.

# assign to a new reunion dataframe
# the result of a inner merge between dataframes `duo1` and `duo2`
# using column "hometown" as the key
reunion = merge(duo1, duo2, by = "hometown")

# show results
reunion
##           hometown            name.x color.x              alter.x duplicated.x
## 1  a desert planet Vash the Stampede     Red Nicholas D. Wolfwood         TRUE
## 2      castle town              Link   Green                 <NA>         TRUE
## 3       coolsville        Scooby Doo  Orange               Shaggy         TRUE
## 4         corellia          Han Solo    Blue                 <NA>         TRUE
## 5           eville             Mandy     Red                 <NA>         TRUE
## 6           gotham            Batman     Red                Robin         TRUE
## 7         hogwarts  Hermione Granger  Yellow                 <NA>         TRUE
## 8           koriko              Kiki     Red                 <NA>         TRUE
## 9           london          Sherlock    Blue                 <NA>         TRUE
## 10     london (13)          Sherlock    Blue               Watson         TRUE
## 11   republic city              Aang  Orange               Katara         TRUE
## 12      santa rosa     Konrad Franco     Red    Konstanze Conradt         TRUE
## 13    stars hollow      Rory Gilmore  Yellow                 <NA>         TRUE
##                  name.y color.y           alter.y duplicated.y
## 1  Nicholas D. Wolfwood    Blue Vash the Stampede        FALSE
## 2                 Zelda  Violet              <NA>        FALSE
## 3                Shaggy   Green        Scooby Doo        FALSE
## 4             Chewbacca  Orange              <NA>        FALSE
## 5                 Billy    Blue              <NA>        FALSE
## 6                 Robin  Yellow            Batman        FALSE
## 7          Harry Potter     Red  Hermione Granger        FALSE
## 8                  JiJi  Indigo              <NA>        FALSE
## 9                Watson  Indigo              <NA>        FALSE
## 10               Watson  Indigo          Sherlock        FALSE
## 11               Katara    Blue              Aang        FALSE
## 12    Konstanze Conradt  Orange     Konrad Franco        FALSE
## 13      Lorelai Gilmore  Orange      Rory Gilmore        FALSE

A bit messy still, but we did it! Each duo has been reunited in one row. Now we can clean up the results a bit, and arrive at our desired outcome. You’ll notice some column names now have a “.x” or “.y” at the end. This is because in the original duo dataframes, the columns had the same name. Since two columns can’t have the same name once we join them in a single dataframe, the merge function add these suffixes to denote which column came from which dataframe. The columns from the first dataframe in the merge function always gets the “.x” ending, while the second gets the “.y” ending.

# first let's keep only the columns we want
# assign to new dataframe `outcome` the contents of `reunion`
# SUCH THAT
# all rows
# and the columns are named 
outcome = reunion[ , c("name.x", "color.x", "name.y", "color.y", "hometown")]

Lastly, we can rename our columns.

# rename columns of dataframe `outcome`
colnames(outcome) = c("mem_1", "mem_1_color", "mem_2", "mem_2_color", "hometown")

# show results
head(outcome, 10)
##                mem_1 mem_1_color                mem_2 mem_2_color
## 1  Vash the Stampede         Red Nicholas D. Wolfwood        Blue
## 2               Link       Green                Zelda      Violet
## 3         Scooby Doo      Orange               Shaggy       Green
## 4           Han Solo        Blue            Chewbacca      Orange
## 5              Mandy         Red                Billy        Blue
## 6             Batman         Red                Robin      Yellow
## 7   Hermione Granger      Yellow         Harry Potter         Red
## 8               Kiki         Red                 JiJi      Indigo
## 9           Sherlock        Blue               Watson      Indigo
## 10          Sherlock        Blue               Watson      Indigo
##           hometown
## 1  a desert planet
## 2      castle town
## 3       coolsville
## 4         corellia
## 5           eville
## 6           gotham
## 7         hogwarts
## 8           koriko
## 9           london
## 10     london (13)

Checkpoint

  1. Is there anything from the above section you still have questions about?
  2. Now that we’ve gotten to the outcome, would you have tried anything differently?

Real-World Application

That was hopefully a fun spreadsheet-based adventure, and I’m glad it had a happy ending. But how can we use what we’ve learned in the real world? The inspiration for this lesson came from a task I needed to accomplish for a student organization I helped run, called Hack 4 California. The lab would meet weekly to work on civic data projects for our community, from small projects like helping the campus food pantry with usage statistics, and to full-fledged examinations of government data regarding exposure to hazardous materials in prisons with collaborators at UCLA.

To continue learning from the experience, my co-lead and I wanted to calculate attendance, and reach out to everyone who had participated to ask them what they thought! However, we had an issue. Because of the pandemic, half of our attendance data was in PDFs of excel sheets, and the other half was in zoom attendance logs. In total we had two PDFs, and 28 Zoom logs. You can see versions of these files with fake names below.

A PDF of a spreadsheet with dates as columns containing names

PDF Attendance Data

A spreadsheet log of of zoom attendance with names, times, and metadata

Zoom Log Attendance Data

The data formats are not ideal (Why PDFs of an excel sheet?), but such is the case with most projects. You can import versions of these dataframes to work with using the following code:

source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/3_real_examples.R")

Data Challenge

We wanted to create a single dataframe that contained complete attendance data, so that we could see how many times each person attended. Using the skills you’ve learned in this lecture and the course so far, create a single dataframe which contains a row for each person, and columns containing their name, a count of how many times they attended, and their email if available.

Tips

  1. Create a mock up of your desired output, work towards that goal.
  2. What information do you need to keep from the raw data?
  3. What do you think you can use to merge on? What obstacles are there to doing so?
  4. How will you need to format the data so you can merge? What are the rows and columns?